03:00
STA 101 - Summer I 2022
Raphael Morsomme
Breakdown of variables into their respective types.
Source: IMS
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey
Group exercise - data summaries?
03:00
There are 3,652 observations (rows)
and 6 variables (columns)
# A tibble: 6 x 6
year month date_of_month date day_of_week births
<int> <int> <int> <date> <ord> <int>
1 1994 1 1 1994-01-01 Sat 8096
2 1994 1 2 1994-01-02 Sun 7772
3 1994 1 3 1994-01-03 Mon 10142
4 1994 1 4 1994-01-04 Tues 11248
5 1994 1 5 1994-01-05 Wed 11053
6 1994 1 6 1994-01-06 Thurs 11406
We can change the number of bins to have a rougher or more detailed histogram.
It is always a good idea to make a histogram of continuous variables. To describe a distribution, we comment on
Note
Note that some distributions will not fit nicely in these categories.
The distribution of the daily number of births in the US is bimodal with each mode being bell-shaped and symmetric. We observe no extreme value.
Group exercise - describing a distribution
Exercises 5.10
03:00
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
We will look at the relation between engine size (disp) and fuel efficiency (hwy).
Scatterplots are used to visualize the relation between two numerical variables.
Note
To add an additional variable to your visualization, you can use color or symbols.
The average: \(\bar{x} = \dfrac{x_1 + \dots + x_n}{n}\)
The median: the middle value
Percentiles are a generalization of the median.
Since the median value is larger than 50% of the data and smaller than the rest it is called the 50th percentile.
Similarly, the value that is larger than p% of the data and smaller than the rest is called the p-th percentile.
We will soon make use of the 25th and 75th percentiles.
Later in the course, the 95th and 97.5th percentiles will also be useful.
Real-world data often contain extreme values - measurement error, - typo - …
The average, median, variance, sd and iqr are not equally robust to the presence of extreme values.
Let us contaminate the birth data with a value of 1 billion…
…and compare the mean, median, variance, sd and iqr of these two variables
Min. 1st Qu. Median Mean 3rd Qu. Max.
6443 8844 11615 10877 12274 14540
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.443e+03 8.845e+03 1.162e+04 2.846e+05 1.228e+04 1.000e+09
[1] 3454270
[1] 2.737417e+14
[1] 1858.567
[1] 16545140
[1] 3429.75
[1] 3430
Robustness of the median and the iqr
While the median and iqr are robust to the presence of extreme values, the mean, the variance and the sd are not.
Group exercise - summary statistics
Exercises 5.8, 5.11, 5.15 (replace part \(c\) by height of adults)
Note: Q1 is first the 25th percentile (larger than one quarter of the data), Q3 is the 75th percentile.
06:00
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
Group exercise - contigency and proportion table
04:00
Group exercise - pros and cons of barplots
Exercise 4.5
03:00
✅ Combines the strengths of the various barplots
🛑 Not in the tool box of every data scientist
from R4DS, Wickham and Grolemund
The thick line in the middle of the box indicates the median
The box stretches from the 25th percentile to the 75th percentile; it covers 50% of the data.
The length of the whiskers are at most 1.5 iqr
Any observation more than 1.5 iqr away from the box is labelled as an outlier.
Outliers
Outliers have an extreme value. How to deal with an outlier depends on why the observation stands out.
Group exercise - types of associations
Exercise 5.13
02:00
table(d_car$class, d_car$drv) %>%
prop.table(1) %>%
round(2) %>%
kbl(caption = "Distribution of drive type per class of car") %>%
kable_classic(full_width = FALSE, c("striped", "hover"))| 4 | f | r | |
|---|---|---|---|
| 2seater | 0.00 | 0.00 | 1.00 |
| compact | 0.26 | 0.74 | 0.00 |
| midsize | 0.07 | 0.93 | 0.00 |
| minivan | 0.00 | 1.00 | 0.00 |
| pickup | 1.00 | 0.00 | 0.00 |
| subcompact | 0.11 | 0.63 | 0.26 |
| suv | 0.82 | 0.00 | 0.18 |
📋 See this vignette for more details on editing tables
Have a purpose: is the figure necessary?
Pasimony: keep it simple and avoid distractions
Tell a story: provide context and interpret the figure
\(\ge3\) variables as much as possible: color, facets, etc.
Edit your figure: title, axes, etc
. . .
📋 See R for Data Science - chapters 3 and 7 for more on data visualization in R.
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey